Learning Patterns from the Web to Translate Named Entities for Cross Language Information Retrieval
نویسندگان
چکیده
Named entity (NE) translation plays an important role in many applications. In this paper, we focus on translating NEs from Korean to Chinese to improve Korean-Chinese cross-language information retrieval (KCIR). The ideographic nature of Chinese makes NE translation difficult because one syllable may map to several Chinese characters. We propose a hybrid NE translation system. First, we integrate two online databases to extend the coverage of our bilingual dictionaries. We use Wikipedia as a translation tool based on the inter-language links between the Korean edition and the Chinese or English editions. We also use Naver.com’s people search engine to find a query name’s Chinese or English translation. The second component is able to learn Korean-Chinese (KC), Korean-English (K-E), and EnglishChinese (E-C) translation patterns from the web. These patterns can be used to extract K-C, K-E and E-C pairs from Google snippets. We found KCIR performance using this hybrid configuration over five times better than that a dictionary-based configuration using only Naver people search. Mean average precision was as high as 0.3385 and recall reached 0.7578. Our method can handle Chinese, Japanese, Korean, and nonCJK NE translation and improve performance of KCIR substantially.
منابع مشابه
Soundex-based Translation Correction in Urdu–English Cross-Language Information Retrieval
Cross-language information retrieval is difficult for languages with few processing tools or resources such as Urdu. An easy way of translating content words is provided by Google Translate, but due to lexicon limitations named entities (NEs) are transliterated letter by letter. The resulting NEs errors (zynydyny zdn for Zinedine Zidane) hurts retrieval. We propose to replace English non-words ...
متن کاملGenerating Patterns for Extracting Chinese-Korean Named Entity Translations from theWeb
One of the main difficulties in Chinese-Korean cross-language information retrieval is to translate named entities (NE) in queries. Unlike common words, most NE’s are not found in bilingual dictionaries. This paper presents a pattern-based method of finding NE translations online. The most important feature of our system is that patterns are generated and weighed automatically, saving considera...
متن کاملCross-Reading News
Journalists often need to perform multiple actions, using different tools, in order to create content for publication. This involves searching the web, curating the result list, choosing relevant entities for the article and writing. We aim to improve this pipeline through CrossReading News, a modular, and extendable web application aimed at helping journalists easily research and draft article...
متن کاملA High-Accurate Chinese-English NE Backward Translation System Combining Both Lexical Information and Web Statistics
Named entity translation is indispensable in cross language information retrieval nowadays. We propose an approach of combining lexical information, web statistics, and inverse search based on Google to backward translate a Chinese named entity (NE) into English. Our system achieves a high Top-1 accuracy of 87.6%, which is a relatively good performance reported in this area until present.
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008